Nstep Experience Replay

1 Overview

To reduce fluctuation of random sampling effect especially at bootstrap phase, N-step reward (discounted summation) are useful. By expanding Bellman equation, a N-step target of Q function becomes \(\sum _{k=0}^{N-1} \gamma ^k r_{t+k} + \gamma ^N \max _{a} Q(s_{t+N},a)\).

According to W. Fedus et al., N-step reward can utilize larger buffer more effectively. Even though theoretically N-step reward, which is based on a policy at exploration, is not justified for off-policy, it still works better.

You can create N-step version replay buffer by specifying Nstep parameter at constructors of ReplayBuffer or PrioritizedReplayBuffer. Without modification of its environment, cpprb summarizes N-step rewards and slides “next” values like next_obs.

Nstep parameter is a dict with keys of "size" , "rew", "gamma" , and "next" . Nstep["size"] is a N-step size and 1-step is identical with ordinary replay buffer (but inefficient). Nstep["rew"], whose type is str or array-like of str, specifies the (set of) reward(s). Nstep["gamma"] is a discount factor for reward summation. Nstep["next"] , whose type is str or array like of str, specifies the (set of) next type value(s), then sample method returns (i+N)-th value instead of (i+1)-th one.

sample also replaces "done" with N-step version.

cpprb v10 no longer returns \(\gamma ^{N-1}\) since users can always multiply fixed \(\gamma ^N\).

Since N-step buffer temporary store the values into local storage, you need to call on_episode_end member function at the end of the every episode end to flush into main storage properly.

Parameters Type Description
size int Nstep size
rew str or array-like of str Nstep reward
gamma float Discount factor
next str or array-like of str Next items (e.g. next_obs)
Return Value Replace: From \(\to\) To
Next items (e.g. next_obs) \(s_{t+1} \to s_{t+N}\)
rew \(r_t \to \sum _{n=0}^{N-1} \gamma ^n r_{t+n}\)
done \(d_t \to 1-\prod _{n=0}^{N-1} (1-d_{t+n})\)

2 Example Usage

import numpy as np
from cpprb import ReplayBuffer

nstep = 4
gamma = 0.99
discounts = gamma ** nstep

rb = ReplayBuffer(32,{'obs': {"shape": (4,4)},
                      'act': {"shape": 3},
                      'rew': {},
                      'next_obs': {"shape": (4,4)},
                      'done': {}},
                  Nstep={"size": nstep,
                         "gamma": gamma,
                         "rew": "rew",
                         "next": "next_obs"})

for i in range(100):
    done = 1.0 if i%10 == 9 else 0.0
    rb.add(obs=np.zeros((4,4)),
           act=np.ones((3)),
           rew=1.0,
           next_obs=np.zeros((4,4)),
           done=0.0)
    if done:
        rb.on_episode_end()

sample = rb.sample(16)

nstep_target = sample["rew"] + (1-sample["done"]) * discounts * Q(sample["next_obs"]).max(axis=1)

3 Notes

This N-step feature assumes sequential transitions in a trajectory (episode) are stored sequentially. If you utilize distributed agent configuration, you must add a single episode simultaneously.

4 Technical Detail